A High-speed Map Architecture with Optimized Memory Size and Power Consumption1

نویسندگان

  • Alexander Worm
  • Holger Lamm
  • Norbert Wehn
چکیده

This paper presents a novel high-speed maximum a posteriori (MAP) decoder architecture with optimized memory size and power consumption. Area and power consumption are both reduced significantly, compared to the state-of-the-art. The architecture is also capable of decoding recursive systematic convolutional codes which are the constituent codes of the revolutionary Turbo-Codes and related concatenation schemes. The architecture is highly scalable with respect to throughput, expanding its applicability over a wide range of throughput requirements (300 Mbit/s–45 Gbit/s and above). INTRODUCTION The MAP algorithm is a maximum-likelihood decoding method which minimizes the probability of symbol (or bit) error. In other words, a MAP decoder finds for each time-step the most likely information bit to have been transmitted given a received noisy or distorted sequence, thus minimizing the bit-error rate (BER). This is unlike a Viterbi decoder [1] which finds the most likely information bit sequence (or code word). Most importantly, the MAP decoder inherently provides a soft output, that can be effectively used for decoding concatenated codes. With respect to coding gain, the MAP algorithm is superior to the soft output Viterbi algorithm (SOVA) [2] which produces qualitatively inferior soft outputs. This is especially important for latest code concatenation schemes, e. g. Turbo-Codes [3]. The MAP algorithm is therefore an increasingly important building block in present and future communication systems. The significant interest during recent years in high-speed implementations of the Viterbi algorithm can be expected to migrate towards highspeed MAP implementations. Architectural experiments have shown that up to 50–80% of the area cost in (application-specific) architectures for real-time signal processing is due to memory units. The power cost is even more heavily dominated by storage and transfers of complex data types [4]. Smaller on-chip memories, if accessed in a first-in 1Patent application filed by Motorola, Inc. 2Alexander Worm is the recipient of a Motorola Partnerships in Research Grant. first-out (FIFO) manner, are often implemented as register-chains which are power hungry due to the high clock-net load and switching activity. Power consumption depends linearly on both switching rate and capacity and can therefore be reduced very effectively by minimizing FIFO memory size. Thus, memory transfer and size optimization on an architectural level is an essential prerequisite to obtain areaand energy-efficient high-speed decoder implementations. This is especially true for a high-speed MAP decoder which requires a lot of memory due to the forwardbackward structure of the algorithm and due to heavy pipelining. THE MAP ALGORITHM The MAP algorithm will not be described in detail here (see, e. g., [5, 3, 6]). We state the important results for non-systematic convolutional (NSC) and for recursive systematic convolutional (RSC) codes only. The MAP algorithm computes the log-lilekihood ratio (LLR) of the source symbol Ik in step k, conditioned on the knowledge of the received distorted symbol sequence z0 = (z0;z1; : : : ;zk; : : : ;zN): Λk = log PrfIk = 1jzN0 g PrfIk = 0jzN0 g : (1) Computation of PrfIkjzN0 g is done by determining the probability to reach a certain encoder state m after having received k symbols z 1 0 = (z0;z1; : : : ;zk 1): αk(m) = Prfmjzk 1 0 g (2) and the probability to get from encoder state m0 to the final state in step N with the symbols zNk+1: βk+1(m0) = PrfzNk+1jm0g: (3) The probability of the transition m ! m0 using the source symbol Ik, under knowledge of the received symbol zk, is called γk: γk(m;m0; Ik) = Prfm;m0; Ikjzkg: (4) The probabilities αk and βk+1 can be found recursively over the γk, which are a function of the received symbols and the channel model. Knowing these values for each transition m ! m0, the probability of having sent the symbol Ik in step k is the sum over all paths using the symbol Ik in step k. With φ(Ik) being the set of all transitions with symbol Ik, we can write: PrfIkjzN0 g= ∑ (m;m0)2φ(Ik)αk(m) γk(m;m0; Ik) βk+1(m0): (5) In the case of NSC codes, equation (5) can be written as: PrfIkjzN0 g= ∑ m02φ̃(Ik)αk+1(m0) βk+1(m0): (6) Here, φ̃(Ik) is the set of all states reached by a transition with symbol Ik in step k. It has been pointed out in [7] that it is mandatory to implement the MAP algorithm in the log-domain (Log-MAP) to avoid numerical problems without degrading the decoding performance. This simplifies the arithmetic operations: multiplication becomes addition, and addition becomes the max*-operator [7]. The Max-Log-MAP algorithm [7] is obtained by using the approximation max (ξ1;ξ2) max(ξ1;ξ2). Throughout the remainder of this paper, the definitions δ= logα, ε= logβ, μ= logγ are introduced for notational convenience. RELATED WORK Efficient MAP decoder implementation on custom hardware is a widely unexplored area of research. The most advanced approach was presented by Dawid et al. in [8, 6]. Their architecture is denoted in the following as D-NSC architecture. For high-speed architectures, the frame can be tiled into a number of windows, each of them W steps wide. Each window can be processed independently from each other, thus providing a high degree of parallelism. The data-dependency graph of the D-NSC architecture introduced in [6] for one of those windows is shown on the left side of Figure 1 for W = 6. For reference, the trellis of a constraint length K = 3 code is shown on the top. The trellis of a convolutional code is the unrolled encoder state-chart where nodes having the same label are merged. Throughout the remainder of this paper, a code constraint length of K = 3 is assumed where applicable, by way of example only. The trellis steps are numbered from left to right; in the datadependency graph, time proceeds from top to bottom. The data-dependency graph visualizes the data flow between the arithmetic units. There are different arithmetic units for forward acquisition (explained below), backward acquisition, forward recursion, backward recursion, forward recursion with LLR calculation, and backward recursion with LLR calculation. The data-dependency edges each correspond to several values. For example, 2K 1 values correspond to each δ and ε edge of the data-dependency graph; a respective number of values corresponds to each μ edge. The algorithm works as follows: Within a window, forward and backward recursion start simultaneously at different trellis steps, k M and k+M. The branch metrics μk M : : :μk+M 1 are assumed to be pre-computed. Theory shows that the path metrics δ and ε, which are unknown at this stage, set to an arbitrary value, will converge during the first M recursions towards reliable values [8, 6] (M is called the acquisition depth). This phase is called the acquisition phase and, in [6], equals half of the window width W , i. e. W = 2M. Reasonable values for the acquisition length M are determined by the encoder, e. g. M = 10–20 for a constraint length K = 3 code [6]. At the end of the acquisition phase, the path metrics have become reliable and the values can be stored. After another M steps, both recursions have reached the end of the window and are continued in the next one: εk M and δk+M are fed into the left and right neighboring window, respectively; δk M and εk+M are obtained from these windows, and recursions continue with concurrent calculation of the outputs Λk. As said above, all windows of a frame are processed in parallel. Thus no timing problems arise at path metric exchange between adjacent windows. Except proper initialization of the path metrics, no special arrangements are necessary at either end of a frame. k-M Λ k k-1 k+1 k-2 k-M μ μ μ μ μ μ k+M k+1 k k-1 k-M

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Design of Low-Power High-Speed Maximum a Posteriori Decoder Architectures

Future applications demand high-speed maximum a posteriori (MAP) decoders. In this paper, we present an indepth study of design alternatives for high-speed MAP architectures with special emphasis on low power consumption. We exploit the inherent parallelism of the MAP algorithm to reduce power consumption on various abstraction levels. A fully parameterizable architecture is introduced, which a...

متن کامل

Implementation of Optimized 128-point Pipeline Fft Processor Using Mixed Radix 4-2 for Ofdm Applications

Abstract This paper proposes a 128-point FFT processor for Orthogonal Frequency Division Multiplexing (OFDM) systems to process the real time high speed data based on cached-memory architecture (CMA) with the resource Mixed Radix 4-2 algorithm using MDC style. The design and implementation of FFT processor has been done using the above technique to reduce the size and power. Using the above alg...

متن کامل

The Local Wavelet Transform: a memory-efficient, high-speed architecture optimized to a Region-Oriented Zero-Tree coder

1 Abstract The memory required for the implementation of the 2D wavelet transform typically incurs relatively high power consumption and limits the speed performances. In this paper we propose an optimized architecture of the 1D/2D wavelet transform, that reduces the memory size cost with one order of magnitude compared to classical implementation styles. This so-called Local Wavelet Transform ...

متن کامل

Image Encryption by Using Combination of DNA Sequence and Lattice Map

In recent years, the advancement of digital technology has led to an increase in data transmission on the Internet. Security of images is one of the biggest concern of many researchers. Therefore, numerous algorithms have been presented for image encryption. An efficient encryption algorithm should have high security and low search time along with high complexity.DNA encryption is one of the fa...

متن کامل

System-level Exploration of Queuing Management Schemes for Input Queue Packet Switches

In high-performance packet switches, data queuing has to allow real-time memory (de)allocation, buffering, retrieving, and forwarding of incoming packets. Its implementation must be highly optimized to combine high speed, low power, large data storage, and high memory bandwidth. In this paper, such data queuing is used as a case study to demonstrate the effectiveness of a new system-level explo...

متن کامل

Ultra-Low-Energy DSP Processor Design for Many-Core Parallel Applications

Background and Objectives: Digital signal processors are widely used in energy constrained applications in which battery lifetime is a critical concern. Accordingly, designing ultra-low-energy processors is a major concern. In this work and in the first step, we propose a sub-threshold DSP processor. Methods: As our baseline architecture, we use a modified version of an existing ultra-low-power...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000